Homework 2

Introduction

In this study, the quarterly gasoline and diesel sales (in 1000 m3) of a major distributor between 2000 and 2006, and a number of potential factors dataset is given as the subject. Our initial aim is to inspect the relationships between these factors and sales, in addition to the real life determinants like seasonal effects and trends. After the analysis of the given variables, a model will be tried to fit to come up with predictions of the year 2007.

Visualization and Numerical Analysis

The first step to our research is manipulation of the dataset to fit the proper format for the time series regression modeling and have a general understanding of the variables by numerical analysis.
The variables provided as an input are given below: UGS: Unleaded gasoline sale in a given quarter, RNUV: An index indicating the rate of new unleaded gasoline using vehicles being added to the traffic in a quarter, PU: Average price (adjusted with an index) of a liter of unleaded gasoline in a quarter, PG: Average price (adjusted with an index) of a liter of diesel gasoline in a quarter, NUGV: Number of unleaded gasoline using vehicles in the traffic, NDGV: Number of diesel gasoline using vehicles in the traffic (per 1000 people), GNPA: Agriculture component of Gross National Product (adjusted with an index), GNPC: Commerce component of Gross National Product (adjusted with an index), GNP: Grand total for GNP (agriculture, commerce and other components total).

salesData <- read.csv("/Users/larahos/Desktop/HW2_data.csv")

colnames(salesData) <- c("Quarter","UGS","RNUV","NLPG","PU","PG","NUGV","NDGV","GNPA","GNPC","GNP")

for(i in c(2,4,7,9,10,11)) {
  salesData[,i] <- gsub( " ", "", salesData[,i])
  salesData[,i]=as.numeric(salesData[,i])
}

salesData[,"Quarter"]<-as.yearqtr(salesData[,"Quarter"],format="%Y _Q %q")

sales_data <- data.table(salesData)

salesDataForecast <- salesData[c(29:32),]
##    Quarter     UGS   RNUV   NLPG     PU     PG    NUGV     NDGV    GNPA    GNPC
## 1: 2000 Q1 1128971 0.0146 940000 469.03 355.69 4647500 281.9853 1040173 3483132
## 2: 2000 Q2 1199569 0.0205 941000 459.42 344.58 4742876 284.0813 1760460 4525451
## 3: 2000 Q3 1370167 0.0207 943500 439.98 327.21 4840931 286.7169 6974808 5915204
## 4: 2000 Q4 1127548 0.0163 948000 402.08 300.67 4919685 288.3137 3267125 4929778
## 5: 2001 Q1 1033918 0.0071 950000 411.58 305.75 4954754 287.6237 1004528 3418387
## 6: 2001 Q2 1019754 0.0051 955000 520.39 374.78 4980204 287.8814 1449357 4359831
##         GNP
## 1: 18022686
## 2: 21797130
## 3: 30050207
## 4: 24480153
## 5: 15832648
## 6: 20296918
##     Quarter          UGS               RNUV               NLPG        
##  Min.   :2000   Min.   : 736580   Min.   :0.001200   Min.   : 940000  
##  1st Qu.:2002   1st Qu.: 876210   1st Qu.:0.004100   1st Qu.: 997500  
##  Median :2004   Median : 983573   Median :0.007250   Median :1352000  
##  Mean   :2004   Mean   : 987212   Mean   :0.008906   Mean   :1299922  
##  3rd Qu.:2006   3rd Qu.:1097324   3rd Qu.:0.013150   3rd Qu.:1602000  
##  Max.   :2008   Max.   :1370167   Max.   :0.020700   Max.   :1797400  
##                 NA's   :4                                             
##        PU              PG             NUGV              NDGV      
##  Min.   :402.1   Min.   :300.7   Min.   :4647500   Min.   :282.0  
##  1st Qu.:479.2   1st Qu.:366.9   1st Qu.:5029281   1st Qu.:286.6  
##  Median :526.7   Median :401.6   Median :5186068   Median :291.9  
##  Mean   :519.2   Mean   :400.4   Mean   :5296758   Mean   :304.9  
##  3rd Qu.:565.2   3rd Qu.:449.2   3rd Qu.:5604709   3rd Qu.:322.9  
##  Max.   :609.0   Max.   :471.9   Max.   :6065597   Max.   :357.3  
##                                                                   
##       GNPA              GNPC              GNP          
##  Min.   : 832953   Min.   :3374849   Min.   :15832648  
##  1st Qu.:1348066   1st Qu.:4403081   1st Qu.:21790148  
##  Median :1850657   Median :5124300   Median :25229712  
##  Mean   :2895561   Mean   :5246534   Mean   :25745763  
##  3rd Qu.:3922950   3rd Qu.:5967433   3rd Qu.:29843404  
##  Max.   :7140722   Max.   :7480414   Max.   :36741745  
## 
## Classes 'data.table' and 'data.frame':   32 obs. of  11 variables:
##  $ Quarter: 'yearqtr' num  2000 Q1 2000 Q2 2000 Q3 2000 Q4 ...
##  $ UGS    : num  1128971 1199569 1370167 1127548 1033918 ...
##  $ RNUV   : num  0.0146 0.0205 0.0207 0.0163 0.0071 0.0051 0.0041 0.0048 0.0012 0.0032 ...
##  $ NLPG   : num  940000 941000 943500 948000 950000 ...
##  $ PU     : num  469 459 440 402 412 ...
##  $ PG     : num  356 345 327 301 306 ...
##  $ NUGV   : num  4647500 4742876 4840931 4919685 4954754 ...
##  $ NDGV   : num  282 284 287 288 288 ...
##  $ GNPA   : num  1040173 1760460 6974808 3267125 1004528 ...
##  $ GNPC   : num  3483132 4525451 5915204 4929778 3418387 ...
##  $ GNP    : num  18022686 21797130 30050207 24480153 15832648 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Data Visualization

The graph and the smoothing lines are drawn below to visually inspect the dataset.

ggplot(sales_data, aes(x=Quarter, y=UGS)) +
  geom_line(color="turquoise4") +
  theme_minimal() + 
  labs(x="", y="Sales", title="Unleaded Gasoline Sales per Quarter 2000-2006") +
  theme(plot.title = element_text(hjust=0.5, size=20, face="bold"))+
  geom_smooth(formula = y ~ x, method = "lm")

The dataset shows a decreasing trend over time as seen in the graph above from 2000 to 2007. Furthermore, from the line it can be concluded that the variance is getting relatively smaller around 2003 and gets larger afterwards. To have a better understanding, the datatset is decomposed below:

datats <- ts(sales_data$UGS,freq=4,start=c(2000,1))
decom_data <- decompose(datats, "additive")
plot(decom_data)

When the time series is decomposed, it is seen that there is a significant decreasing trend over time. Also the seasonality effect is seen as having higher values of sales in Q3 each year. The dataset is not stationary. The aim of this project is basically after including seasonal and trend variables, explaining the random residuals in terms of other variables.

Plotting Autocorrelation

acf(sales_data$UGS[1:28])

As illustrated above, there is a significant lag at lag 1. This lag value is a sign of trend, each year being affected by the past years sales. There is a significant lag at lag 4 and relatively high lag at 8, which proves the fact of seasonal effects.

Model Building for Time Series Regression

As a conclusion of the visual and acf analysis of the dataset, trend and seasonality information is being added to the dataset, in order to use in following steps for model fitting and prediction.

#added trend information
sales_data[,trend:=1:.N]

#add quarter information to use in seasonality afterwards
Q=seq(1,4,by=1)
sales_data=cbind(sales_data,Q)

Trend and seasonality approaches to all dataset as a whole, but also lagged values can be a godd guide to catch autocorrelation and dependency betweeen UGS value. Therefore, we added lagged values Y(t-1) and Y(t-4) to the dataset.

#add lagged values of lag1 and lag4
sales_data$lag1 <- NA
sales_data$lag4 <- NA

sales_data$lag1 <- dplyr::lag(sales_data$UGS)
sales_data$lag4 <- dplyr::lag(sales_data$UGS, 4L)

Analysis of the Independent Variables

To understand which variables have a significant correlation between, it is logical to print the output by using ggpairs.

ggpairs(sales_data[,-1])

There is a significant correlation between UGS and NLPG, NUGV, GNPA, NDGV; also trend and lag 4 values as expected. PU and PG values are highly correlated with each other. GNPA, GNPC and GNP values are also have a high correlation. The main goal is to include variables that effects UGS the most but have a low correlation internally.

Model Fitting

The model will be built step by step, checking significance of variables in each step and adding them in a logical order. Our main goal is to increase adjusted R-squared value while satisfying the assumption of random, independent errors with mean 0 and constant variance.

model1 <- lm(sales_data$UGS~trend, data = sales_data)
summary(model1)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend, data = sales_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -199945  -73550  -21904   71224  237369 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1170777      43292  27.043  < 2e-16 ***
## trend         -12660       2608  -4.854 4.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111500 on 26 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.4754, Adjusted R-squared:  0.4552 
## F-statistic: 23.56 on 1 and 26 DF,  p-value: 4.945e-05
checkresiduals(model1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals
## LM test = 19.942, df = 6, p-value = 0.002836

The initial model has started with the obvious trend, it has a low R squared value and also autocorrelation is significantly high. To overcome the lag effect, two methods will be tried; lag 4 and seasonality. The one with the higher R-squared value will be chosen to move on for the further models.

model2 <- lm(sales_data$UGS~trend+lag4, data = sales_data)
summary(model2)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + lag4, data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66549 -42583   -989  47813  81027 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.371e+04  1.178e+05   0.710    0.485    
## trend       2.084e+03  1.960e+03   1.063    0.300    
## lag4        8.254e-01  9.262e-02   8.912  1.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49740 on 21 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8525, Adjusted R-squared:  0.8385 
## F-statistic: 60.69 on 2 and 21 DF,  p-value: 1.87e-09
checkresiduals(model2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals
## LM test = 2.5252, df = 6, p-value = 0.8656
model3 <- lm(sales_data$UGS~trend+as.factor(Q), data = sales_data)
summary(model3)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q), data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81167 -31283  -3458  28640  94502 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1060372      23653  44.830  < 2e-16 ***
## trend           -13497       1147 -11.764 3.28e-11 ***
## as.factor(Q)2   121532      25987   4.677 0.000104 ***
## as.factor(Q)3   273618      26063  10.498 3.03e-10 ***
## as.factor(Q)4    95049      26189   3.629 0.001405 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48570 on 23 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9119, Adjusted R-squared:  0.8966 
## F-statistic: 59.53 on 4 and 23 DF,  p-value: 8.446e-12
checkresiduals(model3)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals
## LM test = 11.462, df = 8, p-value = 0.1769

Even though lag 4 value has dealt with autocorrelation better, it reduced the effect of trend and has a lower R squared value. Therefore factorized seasonal values are chosen to move forward.

model4 <- lm(sales_data$UGS~trend+as.factor(Q)+ NLPG+PU+PG+NUGV+NDGV+GNPA , data = sales_data)
summary(model4)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NLPG + PU + 
##     PG + NUGV + NDGV + GNPA, data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67794 -10461   1987  18461  32208 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.094e+06  8.283e+05   3.735  0.00165 ** 
## trend          1.514e+04  1.284e+04   1.178  0.25486    
## as.factor(Q)2  1.463e+05  2.356e+04   6.212 9.48e-06 ***
## as.factor(Q)3  4.953e+05  1.597e+05   3.102  0.00648 ** 
## as.factor(Q)4  1.607e+05  4.385e+04   3.664  0.00192 ** 
## NLPG          -2.937e-01  1.804e-01  -1.628  0.12195    
## PU             2.276e+01  1.017e+03   0.022  0.98241    
## PG            -1.129e+03  1.437e+03  -0.786  0.44295    
## NUGV          -1.023e+00  3.343e-01  -3.061  0.00708 ** 
## NDGV           1.238e+04  3.581e+03   3.456  0.00302 ** 
## GNPA          -3.613e-02  2.764e-02  -1.307  0.20854    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29700 on 17 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9613 
## F-statistic: 68.12 on 10 and 17 DF,  p-value: 1.055e-11
checkresiduals(model4)

## 
##  Breusch-Godfrey test for serial correlation of order up to 14
## 
## data:  Residuals
## LM test = 26.623, df = 14, p-value = 0.02154

The variables which showed a high correlation with UGS values are included in the model. As seen in the summary, GNPA, PU and PG do not have a significant effect, therefore they will be eliminated. The main reason behind this could be related to the fact that correlation does not indicate to causality. Even though they are eliminated in this context, that only implies that they are insignificant when they are included as this set of variables. Furthermore, our trend lost its significance, however, it will be held in the model in the following steps. It may have lost its significance due to the fact that decreasing trend is demonstrated by many other independent variables in the model, so it is safer to hold it in at least one more step. Also the model has a high lag 1 value.

model5 <- lm(sales_data$UGS~trend+as.factor(Q)+ NUGV +NDGV, data = sales_data)
summary(model5)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NUGV + NDGV, 
##     data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67804 -17414   3897  19110  59734 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.765e+06  4.825e+05   3.658 0.001467 ** 
## trend         -6.596e+03  3.908e+03  -1.688 0.106254    
## as.factor(Q)2  1.242e+05  1.898e+04   6.543 1.76e-06 ***
## as.factor(Q)3  2.798e+05  1.917e+04  14.592 1.83e-12 ***
## as.factor(Q)4  1.071e+05  1.979e+04   5.412 2.28e-05 ***
## NUGV          -5.356e-01  1.590e-01  -3.369 0.002899 ** 
## NDGV           6.619e+03  1.421e+03   4.657 0.000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35420 on 21 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9572, Adjusted R-squared:  0.945 
## F-statistic: 78.32 on 6 and 21 DF,  p-value: 2.816e-13
checkresiduals(model5)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 14.184, df = 10, p-value = 0.1647

When we eliminated the insignificant variables, the R squared value has decreased, so we will try to add each eliminated variables one by one to catch the one resulting in the best improvement.

model6 <- lm(sales_data$UGS~trend+as.factor(Q)+ NUGV + NDGV + PU, data = sales_data)
summary(model6)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NUGV + NDGV + 
##     PU, data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60791 -13274    756  16273  55213 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.869e+06  4.103e+05   4.554 0.000193 ***
## trend         -4.941e+03  3.356e+03  -1.472 0.156576    
## as.factor(Q)2  1.313e+05  1.626e+04   8.077 1.00e-07 ***
## as.factor(Q)3  2.890e+05  1.653e+04  17.484 1.38e-13 ***
## as.factor(Q)4  1.032e+05  1.682e+04   6.132 5.43e-06 ***
## NUGV          -5.136e-01  1.349e-01  -3.807 0.001105 ** 
## NDGV           6.680e+03  1.205e+03   5.545 1.99e-05 ***
## PU            -5.142e+02  1.692e+02  -3.039 0.006474 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30020 on 20 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9707, Adjusted R-squared:  0.9605 
## F-statistic: 94.79 on 7 and 20 DF,  p-value: 6.1e-14
checkresiduals(model6)

## 
##  Breusch-Godfrey test for serial correlation of order up to 11
## 
## data:  Residuals
## LM test = 20.192, df = 11, p-value = 0.04278

PU resulted in a good improvement in the model, so it is also included in the model. Lag 1 will tried to be dealt in the following steps.

model7 <- lm(sales_data$UGS~trend+as.factor(Q)+  NDGV +  PU + NUGV + lag1, data = sales_data)
summary(model7)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU + 
##     NUGV + lag1, data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44528 -10996  -4285  15377  39939 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.796e+06  6.050e+05   4.621 0.000212 ***
## trend         -8.559e+03  3.562e+03  -2.403 0.027253 *  
## as.factor(Q)2  9.041e+04  2.366e+04   3.821 0.001252 ** 
## as.factor(Q)3  3.099e+05  1.729e+04  17.924 6.33e-13 ***
## as.factor(Q)4  1.985e+05  4.069e+04   4.878 0.000121 ***
## NDGV           9.724e+03  2.033e+03   4.782 0.000149 ***
## PU            -6.308e+02  1.628e+02  -3.876 0.001108 ** 
## NUGV          -7.551e-01  2.011e-01  -3.755 0.001450 ** 
## lag1          -4.865e-01  1.947e-01  -2.499 0.022362 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27090 on 18 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9778, Adjusted R-squared:  0.9679 
## F-statistic: 99.11 on 8 and 18 DF,  p-value: 2.71e-13
checkresiduals(model7)

## 
##  Breusch-Godfrey test for serial correlation of order up to 12
## 
## data:  Residuals
## LM test = 20.476, df = 12, p-value = 0.05859

When lag 1 is included, all variables have significance which is a good sign for the validity of the model. However, these much variable pushed the lag 4 higher and adding lag 4 decreased R squared a lot. So we will try some alternative methods to search for a better conclusion.

sales_data$NUGVlag1 <- dplyr::lag(sales_data$NUGV, 1L)
sales_data$NDGVlag4 <- dplyr::lag(sales_data$NDGV, 4L)
sales_data$PUlag4 <- dplyr::lag(sales_data$PU, 4L)

sales_data$NLPGlag4 <- dplyr::lag(sales_data$NLPG, 4L)
sales_data$RNUVlag4 <- dplyr::lag(sales_data$RNUV, 4L)
sales_data$PUlag1 <- dplyr::lag(sales_data$PU, 1L)

model8 <- lm(sales_data$UGS~trend+as.factor(Q)+  NDGV  +  PU + lag1, data = sales_data)
summary(model8)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU + 
##     lag1, data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74477 -15560   3068  24018  44993 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.642e+05  2.721e+05   2.441  0.02463 *  
## trend         -1.508e+04  4.041e+03  -3.731  0.00141 ** 
## as.factor(Q)2  1.398e+05  2.557e+04   5.466 2.84e-05 ***
## as.factor(Q)3  2.991e+05  2.216e+04  13.498 3.47e-11 ***
## as.factor(Q)4  1.089e+05  4.284e+04   2.542  0.01990 *  
## NDGV           2.624e+03  9.712e+02   2.701  0.01415 *  
## PU            -6.498e+02  2.115e+02  -3.073  0.00626 ** 
## lag1          -4.920e-02  2.028e-01  -0.243  0.81090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35210 on 19 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9604, Adjusted R-squared:  0.9458 
## F-statistic: 65.85 on 7 and 19 DF,  p-value: 5.513e-12
checkresiduals(model8)

## 
##  Breusch-Godfrey test for serial correlation of order up to 11
## 
## data:  Residuals
## LM test = 12.914, df = 11, p-value = 0.299
model9 <- lm(sales_data$UGS~trend+as.factor(Q)+  NDGV  +  PU + NUGVlag1 +lag1, data = sales_data)
summary(model9)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU + 
##     NUGVlag1 + lag1, data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46810 -15769  -4205  15933  45342 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.462e+06  5.321e+05   4.627 0.000209 ***
## trend         -1.183e+04  3.258e+03  -3.631 0.001911 ** 
## as.factor(Q)2  8.401e+04  2.497e+04   3.365 0.003449 ** 
## as.factor(Q)3  3.027e+05  1.722e+04  17.574 8.87e-13 ***
## as.factor(Q)4  1.834e+05  3.892e+04   4.712 0.000174 ***
## NDGV           6.726e+03  1.345e+03   4.999 9.30e-05 ***
## PU            -6.872e+02  1.644e+02  -4.180 0.000563 ***
## NUGVlag1      -5.043e-01  1.370e-01  -3.681 0.001709 ** 
## lag1          -5.021e-01  1.998e-01  -2.514 0.021683 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27330 on 18 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9774, Adjusted R-squared:  0.9674 
## F-statistic: 97.38 on 8 and 18 DF,  p-value: 3.162e-13
checkresiduals(model9)

## 
##  Breusch-Godfrey test for serial correlation of order up to 12
## 
## data:  Residuals
## LM test = 24.932, df = 12, p-value = 0.01515

Different lagged values of variables are tried for pulling autocorrelation below the upper limits. Lagged value of NUGV resulted in a good model, with all regressors being significant variables, relatively low lag4 value and more random looking error.

Evaluation of Final Model

Model 9 is chosen to be our final model with significant variables, good R squared value and errors. Below the plots to prove the validity is printed:

finalmodel <- model9
summary(model9)
## 
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU + 
##     NUGVlag1 + lag1, data = sales_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46810 -15769  -4205  15933  45342 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.462e+06  5.321e+05   4.627 0.000209 ***
## trend         -1.183e+04  3.258e+03  -3.631 0.001911 ** 
## as.factor(Q)2  8.401e+04  2.497e+04   3.365 0.003449 ** 
## as.factor(Q)3  3.027e+05  1.722e+04  17.574 8.87e-13 ***
## as.factor(Q)4  1.834e+05  3.892e+04   4.712 0.000174 ***
## NDGV           6.726e+03  1.345e+03   4.999 9.30e-05 ***
## PU            -6.872e+02  1.644e+02  -4.180 0.000563 ***
## NUGVlag1      -5.043e-01  1.370e-01  -3.681 0.001709 ** 
## lag1          -5.021e-01  1.998e-01  -2.514 0.021683 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27330 on 18 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9774, Adjusted R-squared:  0.9674 
## F-statistic: 97.38 on 8 and 18 DF,  p-value: 3.162e-13
plot(model9)

Residuals vs Fitted plot is almost a straight line, especially for the high values of UGS. Q-Q plot fits the line in most of the points which proves the normal distribution assumption of errors.

Forecasts of UGS Values 2007

First, the final model will predict the given dataset of first 28 Quarters, which will be represented on the plot below. Since they move together mostly and it gmade a good job catching the peaks and the trend, the model will be used for predicting the last 4 values.

final_plot <- sales_data
final_plot[,actual:=UGS]
final_plot[,predicted:=predict(finalmodel,final_plot)]

ggplot(final_plot ,aes(x=Quarter)) +
  geom_line(aes(y=UGS,color='actual')) + 
  geom_line(aes(y=predicted,color='predicted'))+
  labs(title = "Actual vs. Fitted UGS", x = "Quarter", y = "UGS (in 1000 m3)")
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Removed 4 row(s) containing missing values (geom_path).

Forecasts for the last 4 UGS values are given below:

prediction_set = sales_data[(29:32),c("UGS","NDGV","lag1","NUGVlag1","PU", "trend", "Q")]
for(i in 1:4) {
  prediction_set[i,1] = predict(finalmodel,newdata = prediction_set[i,])
  if(i<4){
    prediction_set[i+1,"lag1"] = prediction_set[i,1] 
  }
}

prediction_set
##         UGS     NDGV     lag1 NUGVlag1     PU trend Q
## 1: 656222.1 342.1729 872000.0  5825866 565.19    29 1
## 2: 847051.5 346.9407 656222.1  5869018 565.19    30 2
## 3: 956963.8 351.4449 847051.5  5931348 565.19    31 3
## 4: 779731.7 357.2902 956963.8  5991280 565.19    32 4

Conclusion

In this homework, initially a dataset is analyzed whether it is stationary or not. Then the variables are inspected, several models tried and the one with the highest R squared and better in error distribution is chosen to predict last 4 values.